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DETERMINING PROTEIN FUNCTION AND INTERACTION 

FROM GENOME ANALYSIS 

STATEMENT AS TO FEDERALLY SPONSORED RESEARCH 

The U.S. Government has certain rights in this invention pursuant to Grant Nos. DE-FC03- 
87ER60615 awarded by the Department of Energy and GM31299 awarded by the National 
Institute of Health. 

5 CROSS REFERENCE TO RELATED APPLICATIONS 

This application claims priority from Provisional Application Serial No. 60/1 17,844, 
filed January 29, 1999, Provisional Application Serial No. 60/1 18,206, filed February 1, 
1999, Provisional Application Serial No. 60/126,593, filed March 26, 1999, Provisional 
Applications Serial No. 60/134,093, filed May 14, 1999, and Provisional Application Serial 

10 No. 60/134,092, filed May 14, 1999, to which applications priority claim is made under 35 
U.S.C. §1 19(e), the disclosures of which are incorporated herein by reference. The present 

application also incorporates by reference USSN / , for "A Rosetta Stone Method 

For Detecting 3 rotein Function and Protein-Protein Interactions From Genome Sequences" 
(attorney docket No.: 07419-020001) and USSN _/ , , for "Assigning Protein 

15 Functions By Comparative Genome Analysis: Protein Phylogenetic Profiles" (attorney 
docket No.: 07419-021001), filed concurrently on January 28, 2000. Each of the 
aforementioned applications is explicitly incorporated herein by reference in their entirety 
and for all purposes. 

FIELD OF THE INVENTION 

20 The present invention relates to methods and system for predicting the function of proteins. 
In particular, the invention relates to materials, software, automated system, and methods for 
implementing \he same in order to predict the function(s) of a protein. 

BACKGROUND OF THE INVENTION 

A central core 3f modern biology is that genetic information resides in a nucleic acid 
25 genome, and that the information embodied in such a genome (i.e., the genotype) directs cell 
function. This occurs through the expression of various genes in the genome of an organism 

l 



and regulation of the expression of such genes. The expression of genes in a cell or organism 
defines the cell or organism's physical characteristics (Le. 9 its phenotype). This is 
accomplished Ihrough the translation of genes into proteins. 

Proteins (or polypeptides) are linear polymers of amino acids. The polymerization reaction, 
5 which produces a protein, results in the loss of one molecule of water from each amino acid, 
and hence proteins are often said to be composed of amino acid "residues." Natural protein 
molecules ma) contain as many as 20 different types of amino acid residues, each of which 
contains a distinctive side chain. The particular linear sequence of amino acid residues in a 
protein defines the primary sequence, or primary structure, of the protein. The primary 
1 0 structure of a protein can be determined with relative ease using known methods. 

In order to mo -e fully understand and determine potential therapeutics, antibiotic and 
biologies for various organisms, efforts have been taken to sequence the genomes of a 
number of organisms. For example the Human Genome Project began with the specific goal 

1 5 of obtaining the complete sequence of the human genome and determining the biochemical 
function(s) of each gene. To date, the project has resulted in sequencing a substantial portion 
of the human genome (J. Roach, http://weber.u. WasWngton.edu/-roach/hioman_ 
genome_progress2.html) (Gibbs, 1995). At least twenty-one other genomes have already 
been sequenced, including, for example, M. genitalium (Fraser et al 9 1995), MJannaschii 

20 (Bult et al 9 1996), K influenzae (Fleischmann et al, 1995), E. coli (Blattner et al, 1997), 
and yeast (S. cerevisiae) (Mewes et al, 1997). Significant progress has also been made in 
sequencing the genomes of model organism, such as mouse, C elegans, Arabadopsis sp. and 
D. melanogaster. Several databases containing genomic information annotated with some 
functional information are maintained by different organization, and are accessible via the 

25 internet, for example, http://wwwtigr.org/tdb; http://www.genetics.wisc.edu; http://genome- 
www.stanford. edu/~ball; http://hiv-web.lanl.gov; http://www.ncbi.nlm.nih.gov; 
http://www.ebi.ac.uk; http://Pasteur.fr/other/biology; and http://www.genome.wi.mit.edu. 
The raw nucle c acid sequences in a genome can be converted by one of a number of 
available algoiithms to the amino acid sequences of proteins, which carry out the vast array 

30 of processes ir a cell. Unfortunately, these raw protein sequence data do not immediately 
describe how the proteins function in the cell. Understanding the details of various cellular 
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processes {e.g.. metabolic pathways, signaling between molecules, cell division, etc.) and 
which proteins carry out which processes, is a central goal in modern cell biology. 

Throughout evolution, the protein sequences in different organisms have been conserved to 
5 varying degrees. As a result, any given organism contains many proteins that are 

recognizably s milar to proteins in other organisms. Such similar proteins, having arisen 
from the same ancestral protein, are called homologs. 

To a degree homology between proteins is useful in assigning biological functions to new 
protein sequences. The most direct approach for assigning functions to proteins is by 
10 laborious laboratory experimentation. However, if a particular uncharacterized protein 
sequence is homologous to one that has already been studied experimentally, often the 
function of the former can be equated to the function of the latter. 

Unfortunately, the ability to assign functions to proteins by homology is limited. Many 
protein sequences do not have experimentally characterized homologs in other organisms. 
1 5 Depending on the organism, between one-third and one-half of the proteins in a genome 
cannot be assigned functions by homology or other available computational methods. 
Accordingly, new methods for predicting the functions of proteins from genome sequences 
are needed. 

SUMMARY OF THE INVENTION 

20 Determining protein functions from genomic sequences is a central goal of bioinformatics. 
Genomic sequences do not contain explicit information on the function of the proteins that 
they encode, yet this information is critical in medical and agricultural biotechnology. The 
invention provides materials, software, automated system, and methods that are useful for 
predicting prolein function. Such information is useful, for example, for identifying new 

25 genes and identifying potential targets for pharmaceutical compounds. 

In one embodiment, the invention provides a method to predict functional links (e.g., 
associations between proteins) based on the concept that proteins that function together in a 
pathway or structural complex can often be found in another organism fused together into a 
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single protein. By identifying these patterns of relationship or gene fusion one can predict 
the interactions; between unknown proteins based on the similar sequence information found 
in other related proteins (i.e. , either functionally related or physically related). Through 
sequence comparison, one can identify a fused protein, termed herein the "Rosetta Stone" 
. 5 protein, which is similar over different regions to two distinct proteins that are not similar to 
each other. This establishes a functional link between two otherwise unrelated proteins. The 
inventors have discovered that proteins that can be associated together via the Rosetta Stone 
protein tend strongly to be functionally linked. 

10 In another embodiment, the invention provides a computational method that detects proteins 
that participate in a common structural complex or metabolic pathway. Proteins within these 
groups are defined as "functionally-linked." Functionally-linked proteins evolve in a 
correlated fashion, and therefore they have homologs in the same subset of organisms. For 
instance, it is expected that flagellar proteins will be found in bacteria that possess flagella 

1 5 but not in other organisms. Simply put, if two proteins have homologs in the same subset of 
fully (or nearly fully) sequenced organisms but are absent in other organisms they are likely 
to be functiondly-linked. The present invention provides a method wherein this property is 
used to systematically map functional interactions between all the proteins coded by a 
genome. This method overcomes the problems wherein pairs of functionally linked proteins 

20 in general have no amino acid sequence similarity with each other and therefore cannot be 
linked by con\entional sequence alignment techniques. 

One embodiment provides a method of identifying multiple polypeptides as functionally- 
linked, the method including aligning a primary amino acid sequence of multiple distinct 
non-homologous polypeptides to the primary amino acid sequences of a plurality of proteins; 
25 and for any alignment found between the primary amino acid sequences of all of such 

multiple distinct non-homologous polypeptides and the primary amino acid sequence of at 
least one such protein, outputting an indication identifying the at least one such protein as an 
indication of a functional link between the multiple polypeptides. 

In another embodiment, a computer program is provided for identifying a protein as 
30 functionally linked, the computer program comprising instructions for causing a computer 
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system to align a primary amino acid sequence of multiple distinct non-homologous 
polypeptides tc the primary amino acid sequences of a plurality of proteins; and for any 
alignment found between the primary amino acid sequences of all polypeptides and the 
primary amino acid sequence of an at least one such protein, output an indication of an 
5 identity of such protein. 

In yet another embodiment, the invention provides a method of identifying a plurality of 
polypeptides a:; having a functional link, the method including aligning a primary amino acid 
sequence of a protein to the primary amino acid sequences of each of a plurality of distinct 
non-homologous polypeptides; and for any alignment found between the primary amino acid 
1 0 sequence of the protein and the primary amino acid sequence of the plurality of distinct non- 
homologous polypeptides, wherein the primary amino acid sequence of the protein contains 
an amino acid sequence similar to at least two distinct non-homologous polypeptides, 
outputting an iidication identifying any distinct non-homologous polypeptides as 
functionally-lh iked. 

15 In another embodiment the invention provides a computer program, stored on a computer- 
readable medium, for identifying a plurality of polypeptides as having a functional link, the 
computer program comprising instructions for causing a computer system to align a primary 
amino acid sequence of a protein to the primary amino acid sequences of each of a plurality 
of distinct non -homologous polypeptides; and for any alignment found between the primary 

20 amino acid sec uences of the protein and the primary amino acid sequence of the plurality of 
distinct non-homologous polypeptides, wherein the primary amino acid of the protein 
contains an air ino acid sequence from at least two distinct non-homologous polypeptides, 
and output an indication identifying any distinct non-homologous polypeptides as 
functionally-li iked. 

25 In yet another embodiment, the invention provides a method for identifying multiple proteins 
as having a functional link, comprising obtaining data, comprising a list of proteins from at 
least two genocnes; comparing the list of proteins to form a protein phylogenetic profile for 
each protein or protein family, wherein the protein phylogenetic profile indicates the 
presence or absence of a protein belonging to a particular protein family in each of the at 

30 least two genooaes based on homology of the proteins; and grouping the list of proteins based 



on similar prof les, wherein proteins with similar profiles are indicated to be functionally 
linked. 

In yet still another embodiment, the invention provides a computer program, stored on a 
computer-readable medium, for identifying multiple polypeptides as having a functional link, 
5 the computer program comprising instructions for causing a computer system to obtain data, 
comprising a list of proteins from at least two genomes; compare the data to form a protein 
phylogenetic profile for each protein or protein family, wherein the protein phylogenetic 
profile indicates the presence or absence of a protein belonging to a particular protein family 
in each of the r d least two genomes based on homology of the proteins; and group the list of 
10 proteins based on similar profiles, wherein proteins with similar profiles are indicated to be 
functionally linked. 

In yet another embodiment, the invention provides a method for determining an evolutionary 
distance between two proteins, the distances being used as additional information, beyond 
mere presence or absence from a genome, in comparing the phylogenetic profiles of different 

15 proteins. The method including aligning two sequences; determining an evolution 

probability process by constructing a conditional probability matrix: p(aa— »aa'), where aa 
and aa' are an)' amino acids, said conditional probability matrix being constructed by 
converting an amino acid substitution matrix from a log odds matrix to said conditional 
probability matrix; accounting for an observed alignment of the constructed conditional 

20 probability matrix by taking the product of the conditional probabilities for each aligned pair 

during the alignment of the two sequences, represented by iXp)= J"| piaan ) and 

n 

determining an evolutionary distance a from powers equation: p'=p a (aa— ►aa'), maximizing 
for P. In a furl her embodiment, the conditional probability matrix is defined by a Markov 
process with substitution rates, over a fixed time interval. 

25 In yet a further embodiment, the invention provides a method for determining functional 

links between at least two polypeptides, comprising aligning a primary amino acid sequence 
of multiple distinct non-homologous polypeptides to the primary amino acid sequences of a 
plurality of proteins; for any alignment found between the primary amino acid sequences of 
all of such muitiple distinct non-homologous polypeptides and the primary amino acid 
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sequence of at [east one such protein, outputting an indication identifying the at least one 
such protein as an indication of a functional link between the multiple polypeptides; 
obtaining data, comprising a list of polypeptides from at least two genomes; comparing the 
list of polypeplides from at least two genomes to form a protein phylogenetic profile for each 
5 protein or protein family, wherein the protein phylogenetic profile indicates the presence or 
absence of a polypeptide belonging to a particular protein family in each of the at least two 
genomes based on homology of the polypeptides; grouping the list of polypeptides based on 
similar profiles, wherein a similar profile is indicative of a functional link between the 
polypeptides; £,nd comparing the functional links identified above to determine common 
10 links. 

In yet another embodiment, the invention further provides for displaying the functional links 
as networks of related proteins comprising placing all polypeptides in a diagram such that 
functionally linked proteins are closer together than all other proteins and identifying proteins 
that fall in a cluster in the diagram as a functionally related group. 

15 The details of one or more embodiments of the invention are set forth in the accompanying 
drawings and the description below. Other features, objects, and advantages of the invention 
will be apparent from the description and drawings, and from the claims. 

DESCRIPTION OF DRAWINGS 

FIG. 1 A shows five examples of pairs of E. coli proteins predicted to be functionally-linked 
20 by the Rosetta Stone method. In each example, the top protein is the "Rosetta Stone protein" 
and the botton two proteins are functionally linked. 

FIG. IB shows the Rosetta-Stone analysis finds cases where a protein ( c ) is similar over 
different regio is to two distinct, non-homologous proteins (A and B). In such situations, a 
functional relationship is inferred between A and B. Genomes i,j, and k can represent a 
25 single genome, or two or three different genomes. 

FIG. 2A is a flow diagram describing a Rosetta Stone method of the invention beginning 
with the primary sequence of at least two polypeptides having unknown function. 
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FIG. 2B is a flew diagram describing a method of the invention beginning with the primary 
sequence of a Rosetta Stone protein having unknown function. 

FIG. 3 is a schematic of phylogenetic pathways. PI through P7 are distinct non-homologous 
proteins. 

5 FIG 4A. shows a flow diagram describing a phylogenetic profile method of the invention 
using a bit type profiling method. 

FIG. 4B shows a flow diagram describing a phylogenetic profile method of the invention 
using an evolutionary distance method. 

FIG. 5 shows suggestive information on pathways and complexes from linked pairs of 
10 proteins. 5 A a ad 5C represent the shikimate biosynthesis pathway and purine synthesis 

pathway, respectively. 5B and 5D describe the links suggested by the Rosetta Stone method. 

FIG. 6 shows £ model for the evolution of protein-protein interactions. The Rosetta Stone 
model starts w th the fusion of the genes that code for the non-interacting domains A and B, 
leading to expiession of the fused two-domain protein AB. 

1 5 FIG. 7 depicts the occurrence of promiscuous protein domains, those that are found in many 
different proteins and are therefore linked to many different domains. 

FIG. 8 is a diaj'ram showing the process and result of the method of phylogenetic profiles. In 
each case all proteins with identical profiles to the query proteins were found (within the 
double box) ar d then all those with profiles that differed by one bit (in the second column), 

20 Proteins in bold face participate in the same complex or pathway as the query protein and in 
italics participate in a different but related complex or pathway. Proteins with identical 
profiles are shown within a box. Single lines between boxes represent a one-bit difference 
between the two profiles. All neighboring proteins whose profiles differ by one bit from the 
query protein iire shown. Homologous proteins are connected by a dashed line or indented. 

25 Each protein is; labeled by a four-digit E. coli number, a Swissprot gene name and a brief 
description. Notice that proteins within a box or in boxes connected by a line have similar 
functions. Hypothetical proteins (i.e. of unknown function) are prime candidates for 
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functional and .structural studies. Proteins in the double boxes in (a), (b) and (c) have 
respectively 1 1 , 6, and 10 ones in their phylogenetic profiles, out of a possible 16 for the 17 
genomes available at the time of calculation. 

FIG. 9 shows strategies used to link functionally-related yeast proteins as described in the 
5 Examples. 

FIG. 10 shows the high confidence functional links found by phylogenetic profiles for the 
yeast protein ^GR021 W, a member of a protein family conserved in many organisms but of 
entirely unknow function. 

FIG. 1 1 A shows high and highest confidence functional links established for the yeast prion 
10 Sup35. (B) Ar illustration of the network of high (thin lines) and highest (bold lines) 

confidence links discovered among the proteins (open circles) linked to Sup35 (dark circle). 
The network o T links shows a high degree of local clustering. 

FIG. 12 shows high and highest confidence functional links found for the yeast DNA repair 
protein MSH6. which is similar in sequence to colorectal cancer-causing proteins in humans. 

15 Like reference symbols in the various drawings indicate like elements. 

DETAILED DESCRIPTION OF THE INVENTION 

As used herein and in the appended claims, the singular forms "a," "and," and "the" include 
plural referents unless the context clearly dictates otherwise. Thus, for example, reference to 
"a protein" includes a plurality of proteins and reference to "the polypeptide" generally 
20 includes reference to one or more polypeptides and equivalents thereof known to those 
skilled in the art, and so forth. 

Unless definec otherwise, all technical and scientific terms used herein have the same 
meaning as commonly understood to one of ordinary skill in the art to which this invention 
belongs. Although any methods, devices and materials similar or equivalent to those 
25 described herein can be used in the practice or testing of the invention, the preferred 
methods, devices and materials are now described. 
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All publications mentioned herein are incorporated herein by reference in full for the purpose 
of describing and disclosing the databases, proteins, and methodologies, which are described 
in the publications which might be used in connection with the presently described invention. 
The publications discussed above and throughout the text are provided solely for their 
5 disclosure prior to the filing date of the present application. Nothing herein is to be 

construed as an admission that the inventors are not entitled to antedate such disclosure by 
virtue of prior invention. 

Definitions 

10 The following terms have the following meanings when used herein and in the appended 
claims. Terms not specifically defined herein have their art recognized meaning. 

An "amino acii" is a molecule having the structure wherein a central carbon atom (the a- 
carbon atom) is linked to a hydrogen atom, a carboxylic acid group (the carbon atom of 
which is referred to herein as a "carboxyl carbon atom"), an amino group (the nitrogen atom 
15 of which is referred to herein as an "amino nitrogen atom"), and a side chain group, R. 

When incorporated into a peptide, polypeptide, or protein, an amino acid loses one or more 
atoms of its amino acid carboxylic groups in the dehydration reaction that links one amino 
acid to anothei. As a result, when incorporated into a protein, an amino acid is referred to as 
an "amino acid residue." 

20 "Protein" refeis to any polymer of two or more individual amino acids (whether or not 

naturally occumng) linked via a peptide bond, and occurs when the carboxyl carbon atom of 
the carboxylic acid group bonded to the a-carbon of one amino acid (or amino acid residue) 
becomes covalently bound to the amino nitrogen atom of amino group bonded to the 
a-carbon of an adjacent amino acid. The term "protein" is understood to include the terms 

25 "polypeptide" and "peptide" (which, at times may be used interchangeably herein) within its 
meaning. In addition, proteins comprising multiple polypeptide subunits {e.g., DNA 
polymerase III, RNA polymerase II) or other components (for example, an RNA molecule, as 
occurs in telomerase) will also be understood to be included within the meaning of "protein" 
as used herein. Similarly, fragments of proteins and polypeptides are also within the scope of 

30 the invention snd may be referred to herein as "proteins " 
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A particular amino acid sequence of a given protein {i.e., the polypeptide's "primary 
structure " whea written from the amino-terminus to carboxy-terminus) is determined by the 
nucleotide sequence of the coding portion of a mRNA, which is in turn specified by genetic 
information, typically genomic DNA (including organelle DNA, e.g., mitochondrial or 
5 chloroplast DNA). 

A "functional Lnk" or "functionally-linked polypeptides" is meant polypeptides that are 
predicted to be linked, for example, in a common biochemical or metabolic pathway, part of 
a related protein complex, physically interact, or act upon one another. 

1 0 ROSETTA STONE METHOD 

This method compares proteins sequences across all known genomes and finds cases where 
proteins that are separate in one organism (or separately contained in two different 
organisms) are joined into one larger protein in another organism. In such cases, the two 
separate protei is often carry out related or sequential functions or form part of a larger 

15 protein complex. Therefore, the general function of one component (e.g., one or more of the 
unknown proteins) can be inferred from the function of the other component if it is known. 
In addition, merely identifying links between proteins using the method described herein 
provides valuable information regardless of whether the function of one or more of the 
proteins used to form the link(s) is known. The two components do not have similar amino 

20 acid sequence, so the function of one would not be inferred from the other on the basis of 
sequence similarity alone. 

The methods described herein {i.e., the "Rosetta Stone Method") is based on the idea that 
proteins that participate in a common structural complex, metabolic pathway, biological 

25 process or with closely related physiological functions are functionally linked. In addition, 
the method is also capable of identifying proteins that interact physically with one another. 
Functionally linked proteins in one organism can often be found fused into a single 
polypeptide chain in a different organism. Similarly, fused proteins in one organism can be 
found as indiv dual proteins in other organisms. For example, in a first organism, or in two 

30 separate organisms, one might identify two un-linked proteins "A" and "B" with unknown 
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function. In anDther organism, one may find a single protein "AB" with a part that resembles 
"A" and a part ;hat resembles "B". Protein AB allows one to predict that "A" and "B" are 
functionally-rel ated. 



The particular junctional activity of each distinct protein in the Rosetta Stone method need 

5 not be known prior to performing the method (i.e., the function of A, B, or AB need not be 
known). Perfo ming the Rosetta Stone method with unknown proteins can provide 
information regarding relationships of each protein absent knowledge of the functional 
activity of the proteins themselves. For example, the information (i.e., the links) can provide 
information that the proteins are part of a common pathway, function in a related process or 

10 physically interact Such information need not be based on the biological functions of the 
individual proteins. The method of the invention can provide information regarding 
functional links between proteins not previously known to function together,for example, in a 
concerted process. A marker, for example, for a particular disease state is identified by the 
presence or absence of a protein (e.g., Her2/neu in breast cancer detection). Links (i.e., 

1 5 information) ic entified by the methods of the invention, which link proteins "B" and "C" to 
such a marker suggest that proteins "B" and "C" are related by function, physical interaction 
or are part of a common biological pathway with the marker. Such information is useful in 
making diagnostics, identifying drug targets and therapeutics. Accordingly, the Rosetta 
Stone method of the invention is performed by sequence comparison that searches for 

20 incomplete "triangle relationships" between, for example, three proteins, i.e., for two proteins 
A' and B' that are different from one another but similar in sequence to another protein AB. 
Completing the triangle relationship provides useful information regarding the proteins' 
biological function, functional interaction, pathway relationships or physical relationships 
with other proleins in the "triangle". 

25 

As an example, FIG. 1 shows five examples of pairs of E. coli proteins predicted to interact 
by the domain fusion analysis (ie. , the Rosetta Stone method). Each protein is shown 
schematically with boxes representing domains (as defined in the ProDom domain database). 
For each example, a triplet of proteins is pictured. The second and third proteins are 
30 predicted to in:eract because their homologs are fused in the first proteins (called the Rosetta 
Stone protein). The first three predictions are known to interact from experiments (Sugino et 



al Nucleic Acids Res. 8, 3865 (1980); Yehand Ornston, J. Biol. Chem., 256, 1565 (1981); 
McHenry and Crow, J. Biol. Chem, 254, 1748 (1979)). The final two examples show pairs 
of proteins from the same pathway (two nonsequential enzymes from the histidine 
biosynthesis pathway and the first two steps of the proline biosynthesis pathway) that are not 

5 known to inters. ct directly. The inventors have recognized that when this pattern of three 

proteins exists - two separate proteins from a first organism (or from two distinct organisms) 
that are homologous to different portions of a single protein from another organism - the two 
separate proteins are usually "functionally-related" based on the data showing they have a 
higher than random chance of being physically or functionally linked. Accordingly, the 

1 0 invention overcomes the shortfalls of previous methods by providing a relationship between 
the linked proteins found by the Rosetta Stone Method though they do not have amino acid 
sequence similarity with each other and therefore cannot be linked by conventional sequence 
alignment techiiques. 

The methods of the invention are applicable to both nucleotide sequences and amino acid 
1 5 sequences. Ty pically amino acid sequences will be used to perform the methods of the 

invention. However, where a nucleic sequence is to be used it is typically translated from a 
nucleic acid sequence to amino acid sequence. Such translation may be performed in all 
frames of the nucleic acid sequence if the coding sequence is not known. Programs that can 
translate a nucleic acid sequence are known in the art. In addition, for simplicity , the 
20 description of ihe invention discusses the use of a "pair" of proteins in the determination of a 
Rosetta Stone protein, more than 2 (e.g., 3, 4, 5, 10, 100 or more proteins) may be used. 
Accordingly, cne can analyze chains of linked proteins, such as "A" linked by a Rosetta 
Stone protein to "B" linked by a Rosetta Stone protein to "C", etc. By this method, groups of 
functionally related proteins can be found and their function identified. 

25 In one embodi nent the method of the invention starts with identifying the primary amino 
acid sequence for a plurality of proteins whose functional relationship is to be determined 
(e.g., protein A' and protein B'). A number of source databases are available, as described 
above, that contain either a nucleic acid sequence and/or a deduced amino acid sequence for 
use with the first step of the invention. All sequences to be tested (the "probe sequences") 

30 are used to search a sequence database (e.g. , GenBank, PFAM or ProDom), either 
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simultaneously or individually. Every protein in the sequence database is examined for its 
ability to act as a Rosetta Stone protein (i.e., a single protein containing polypeptide 
sequences or domains from both protein A 5 and protein B'). A number of different methods 
of performing such sequence searches are known in the art. Such sequence alignment 
5 methods include, for example, BLAST (Altschul et al. , 1 990), BLITZ (MPsrch) (Sturrock & 
Collins, 1993), and FASTA (Person & Lipman, 1988). The probe sequence can be any 
length {e.g., about 50 amino acid residues to more than 1000 amino acid residues). 

Probe sequences {e.g., polypeptide sequences or domains) found in a single protein {e.g., AB 
protein) are defined as being "linked" by that protein. Pairs of probe sequences are used 
10 individually to search the sequence database, one can mask those segments having homology 
to the first probe sequence found in the proteins of the sequence database prior to searching 
with the subsequent probe sequence. In this way, one eliminates any potential overlapping 
sequences between the two or more probe sequences. 

The linked pro ;eins can then be further compared for similarity with one another by amino 
15 acid sequence comparison. Where the sequences have high homology, such a finding can be 
indicative of the formation of homo-dimers, - trimer, etc. Typically, Rosetta Stone linked 
proteins are only kept when the linked proteins show no homology to one another {e.g., 
hetero-dimers, trimer etc.). 

In another embodiment of the method of the invention, a potential fusion protein lacking any 
20 functional information and that is suspected of having two or more domains {e.g. , a potential 
Rosetta Stone Protein) may be used to search for related proteins by a similar method. In this 
embodiment, tie primary amino acid of the fusion protein is determined and used as a probe 
sequence. This probe sequence is used to search a sequence database {e.g., GenBank, PFAM 
or ProDom). Every protein in the sequence database is examined for homology to the 
25 potential fusio i protein (/. e. , multiple proteins containing polypeptide sequences or domains 
from the potential fusion protein). A number of different methods of performing such 
sequence searches are known in the art {e.g., BLAST, BLITZ (MPsrch), and FASTA). 



Probe sequences found in a more than one protein {e.g., A' and B ? proteins) are defined as 
being "linked'' so long as at least one protein per domain containing that domain but not the 
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other is also identified. In other words, at least one protein or domain of the plurality of 
proteins must zho be found alone in the sequence database. This verifies that the protein or 
domain is not zin integral part of a first protein but rather a second independent protein having 
its own functional characteristics. 

5 Statistical methods can be used to judge the significance of possible matches. The statistical 
significance of an alignment score is described by the probability, P, of obtaining a higher 
score when the sequences are shuffled. One way to compute a P value threshold is to first 
consider the total number of sequence comparisons that are to be performed. If there are N 
proteins in E. coli and M in all other genomes this number is JVx M If a comparison of this 
10 number of random sequences would result in one pair to yield a P value of \/NM by chance, 
this then is set as the threshold. The threshold may be set lower or higher according to the 
accuracy desired. 

The method of the invention provides information regarding which proteins are functionally 
related (e.g., related biological functions, common structural complexes, metabolic pathways, 
15 signaling pathways, or other biological process) a subset of which proteins physically interact 
in an organism. 

FIG. 2 is an operational flow diagram generally illustrating two embodiments of the 
invention. FIGS. 2 A and B depict the use of Rosetta Stone proteins to predict the functional 
link or relationship of proteins. Referring now to FIG. 2A, in step 102 the primary amino 

20 acid sequence Df at least two distinct non-homologous polypeptides is input into a computer. 
The biological function of the two polypeptides may be known or may be unknown. The 
primary sequence of the polypeptides may be input manually (i.e., by typing the sequence 
into a computer) or may be derived from a database of proteins or nucleic acid sequence 
available throigh various databases as described above. "Substantially homologous" means 

25 that the p value of the alignment score is statistically significant. A number of publicly 

available alignment programs can be used to determine the homology including, for example, 
BLAST and FASTA. A comparison of the polypeptide sequences can be performed to insure 
that the polypeptides are non-homologous. As a result only proteins having distinct non- 
homologous polypeptide domains will be used for further analysis. 
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In step 106, the input polypeptide sequences having distinct non-homologous polypeptide 
domains are aligned with the sequences contained in a protein sequence database. The 
proteins may have known or unknown biological functions. Examples of databases with 
protein sequences include for example, GenBank, PFAM, SwissProt or ProDom. Every 

5 protein in the sequence database is examined for homology to the first and second proteins. 
A number of different methods of performing such sequence searches are known in the art 
(e.g., BLAST, BLITZ (MPsrch), and FASTA). Typically, the matches are determined by p 
value thresholc s, as identified above and depicted at step 108. If there are no matches found, 
this determination is indicated at step 110. The input polypeptide sequences may be aligned 

10 simultaneously with the proteins of the database or they may be aligned sequentially. In a 
sequential alignment, those proteins having a match to a previously aligned polypeptide can 
be masked. Matches of proteins from the database containing sequences from all the 
polypeptides input at step 102 (e.g., both containing sequences from both protein A and 
protein B, i.e., the Rosetta Stone protein(s)) are identified, a list compiled and the function of 

1 5 any matched p -oteins indicated at step 114. Where the function of a matched protein is 
known, this function is used to determine possible functions of the unknown polypeptide 
sequences. ALernatively, following alignment and compilation of matched proteins, the 
matched proteins may be further filtered at step 1 12, as described below (see Filtering 
Methods). The inventors have discovered that proteins that can be associated together via the 

20 Rosetta Stone protein tend strongly to be functionally linked. 

Referring now to FIG. 2B, an alternative method for determining functional links of a protein 
is provided. In this embodiment, one starts with a potential Rosetta Stone protein and works 
in reverse. In :>tep 120, the primary amino acid sequence of a Rosetta Stone protein is input 
into the compiter. The primary sequence of the protein may be input manually (i.e., by 
25 typing the sequence into a computer) or may be derived from a database of proteins or 

nucleic acid sequence available to the public through various databases as described above. 

In step 122, the protein sequence is aligned with a database of protein sequences. Every 
protein in the sequence database is examined for homology to domains of Rosetta stone 
protein. A number of different methods of performing such sequence searches are known in 
30 the art (e.g. , B _AST, BLITZ (MPsrch), and FASTA). Typically, matches are determined by 
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p value thresholds, as identified above and depicted at step 124. If there are no matches 
found this determination is indicated at step 126. A list of distinct matched proteins are 
compiled and indicated at step 130. In order to insure that the distinct non-homologous 
polypeptides align to the Rosetta Stone protein in a non-overlapping fashion the distinct 
5 polypeptides can be compared to determine homology. This insures identification of at least 
one protein per domain containing that domain, but not the other domain. In other words, at 
least one prote n or domain of the unknown proteins in the database must also be found alone 
in the sequence database. This verifies that the first matched protein is not homologous to 
the second malched protein. 

10 

Alignment Algorithms 

To align sequences a number of different procedures can be used that produce a good match 
between the corresponding residue in the sequences. Typically, Smith- Waterman or 
Needleman-Wonsch algorithms are used. However, as discussed above faster procedures 
1 5 such as BLAST, FASTA, PSI-BLAST can be used. 

Filtering Methods 

The Rosetta Stone Method described herein provides at least two pieces of information. First 
the method provides information regarding which proteins are functionally related. Second 
the method provides information regarding which proteins are physically related. Each of 

20 these two pieces of information has different sources of error and prediction. The first type 
of error is introduced by protein sequences that occur in many different proteins and paired 
with many othsr protein sequences. The second type of error is introduced due to there often 
being multiple copies of similar proteins, called paralogs, in a single organism. In general, 
the Rosetta Stone Method predicts functionally related proteins well, with no filtering of 

25 results required. However, it is possible to filter the error associated with either the first or 
second type of information. 

The inventors recognized that a few domains are linked to an excessive number of other 
domains by a Rosetta Stone protein. The inventors recognized, for example, that 95% of the 
domains linked to fewer than 13 other domains. However, some domains (e.g., the Src 
30 Homology 3 (SH3) domain or ATP-binding cassette (ABC domains)) link to more than a 
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hundred other domains. These links were filtered by removing all links generated involving 
these 5% of domains the domains linked to more than 13 other domains). For example, 
in E. coli, without filtering, 3531 links were identified using the domain-based analysis, but 
after filtering only 749 links were identified. This method improved prediction of 
functionally related proteins by 28% and physically related proteins by 47%. Accordingly, 
there are a number of ways to filter the results to improve the significance of the functional 
links. As described above, as the number of functional links increases there is a increased 
higher chance of finding a Rosetta Stone proteins. By reducing the excessively linked 
proteins one reduces the chance number of Rosetta Stone proteins and thus increase the 
significance of a functional link. 

In addition it was recognized that error introduced by multiple paralogs of linked proteins 
should have little effect on functional prediction, as paralogs usually have very similar 
function, but will affect the reliability of prediction of protein-protein interactions. This 
estimate is calculated for each linked protein pair, and can be estimated roughly as: 

Fractional Error = \-^— 

N 

where N is the number of paralogous protein pairs, (e.g., A linked to B, A' linked to B\ A 
linked to B\ and A' linked to B, in the case that A and A' are paralogs, as are B and B', and 
the linking protein is AB as above). 

The error can dso be estimated as 1-7, where Tis the mean percent of potential true positives 
calculated for all domain pairs in an organism. For each domain pair linked by a Rosetta 
Stone protein, there are n proteins with the first domain but not the second, and m proteins 
with the second domain but not the first. The percent of true positives T is therefore 
estimated as the smaller of n or m divided by n times m. As this error l-T can be calculated 
for each set of linked domains, it can describe the confidence in any particular predicted 
interaction. 

In addition, the error in functional links can be caused by small conserved regions or repeated 
common amino acid sequences being repeatedly identified in a Rosetta Stone protein by a 

18 



plurality of disiinct non-homologous polypeptides. To reduce this error the alignment 
percentage - the fraction of an entire sequence that can be aligned to another -- between the 
Rosetta Stone iind the distinct non-homologous polypeptide can be measured. Alignment 
percentages of about 50 to 90%, more typically about 75%, between the Rosetta Stone and 
5 the distinct polypeptide are indicative of the links that are not subject to the small peptide 
sequence. 

PHYLOGENETIC PROFILE METHOD 

The phylogenetic profile method compares protein sequences across all or many known 
genomes and analyzes the pattern of inheritance of each protein across the different 

10 organisms. In its simplest form, each protein is simply characterized by its presence or 

absence in each organism. For example, if there are 16 known genomes, then each protein 
may be assigned a 16-bit code or phylogenetic profile. Since proteins that function together 
(e.g., in the same metabolic pathway or as part of a larger structural complex) evolve in a 
correlated fashion, they should have the same or similar patterns of inheritance, and therefore 

15 similar phylogenetic profiles. Therefore, the function of one protein may be inferred from 
the function of another protein, which has a similar profile, if its function is known. As with 
the Rosetta Stone method (above), the function of one protein is inferred from the function of 
another protein which is dissimilar in sequence. Furthermore, even if neither of the two 
proteins has ar assigned function, the predicted link between the proteins has utility in 

20 developing, fo - example, diagnostics and therapeutics. The phylogenetic profile method can 
be implemented in a binary code (i.e., describing the presence or absence of a given protein 
in an organism) or a continuous code that describes how similar the related sequences are in 
the different genomes. In addition, grouping of similar protein profiles may be made wherein 
similar profiles are indicative of functionally related proteins. Furthermore, the requirements 

25 for similarity can be modified depending upon particular criteria by varying the difference in 
similar bit requirements. For example, criteria requiring that the degree of similarity in the 
profile include all 16 bits being identical can be set, but may be modified so that similarity in 
1 5 bits of the 1 6 bits would indicate relatedness of the protein profiles as well Statistical 
methods can b s used to determine how similar two patterns must be in order to be related. 

30 
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The phylogenelic profile method discussed is applicable to any genome including viral, 
bacterial, archasal or eukaryotic. The method of phylogenetic profile grouping provides the 
prediction of function for a previously uncharacterized protein(s). The method also allows 
prediction of new functional roles for characterized proteins. It also provides potential 
5 informative connections (i.e., links) between uncharacterized proteins. 

The method of protein phylogenetic profiles is illustrated schematically in FIG. 3 for the 
hypothetical case of four fully sequenced genomes, in which the functional relationship of 
seven proteins [PI through P7) is described. For each hypothetical E. coli protein a profile 
was constructed, indicating which genomes code for homologs of the protein. A cluster or 
10 group of the profiles was created to determine which proteins share the same profiles. 

Proteins with identical (or similar) profiles are boxed to indicate that they are likely to be 
functionally linked. Boxes connected by lines have phylogenetic profiles that differ by one 
bit and are termed neighbors. 

In one embodiment a computational method detects proteins that participate in a common 
1 5 structural complex or metabolic pathway. Proteins within these groups are defined as 

"functionally-linked" in that functionally-linked proteins evolve in a correlated fashion, and 
therefore have homologs in the same subset of organisms. For example, flagellar proteins are 
found in bacteria that possess flagella but not in other organisms. Accordingly, if two 
proteins have homologs in the same subset of fully sequenced organisms they are likely to be 
20 functionally linked. The methods of the invention use this concept to systematically map 
links between all the proteins coded by a genome. Typically, functionally linked proteins 
have no amino acid sequence similarity with each other and therefore cannot be linked by 
conventional sequence alignment techniques. 

To represent the subset of organisms that contain a homolog a phylogenetic profile is 
25 constructed for each protein. The simplest manner to represent a protein's phylogenetic 

history is via a binary phylogenetic profile for each protein. This profile is a string with N 
entries, each o le bit, where TV corresponds to the number of genomes. The number of 
genomes can te any number of two or more (e.g., 2, 3, 4, 5, 10, 100, to 1000 or more). The 
presence of a homolog to a given protein in the n ih genome is indicated with an entry of unity 
30 at the n th posit on (e.g. , in a binary system an entry of 1). If no homolog is found the entry is 



zero. Proteins <ire clustered according to the similarity of their phylogenetic profiles. Similar 
profiles show a correlated pattern of inheritance, and by implication, functional linkage. The 
method predict:; that the functions of uncharacterized proteins are likely to be similar to 
characterized proteins within a cluster (FIG. 3). 

5 In order to decide whether a genome contains a protein related to another particular protein, 
the query amino acid sequence is aligned with each of the proteins from the genome(s) in 
question using known alignment algorithm (see above). To determine the statistical 
significance of any alignment score, the probability,/?, of obtaining a higher score when the 
sequences are shuffled is described. One way to compute a p value threshold is to first 

10 consider the total number of sequence comparisons that are being aligned. If there are N 

proteins in a first organism's genome and Min all other genomes this number is NxM. If 
this number is compared to random sequences it would be expected that one pair would yield 

a v value of . This value can be set as a threshold. Other thresholds may be used and 

NM 

will be recognized by those of skill in the art. 

1 5 In another emt odiment, a non-binary phylogenetic profile can be used. In this embodiment, 
the phylogenetic profile is a string of Gentries where the n th entry represents the evolutionary 
distance of the query protein to the homolog in the n th genome. To define an evolutionary 
distance between two sequences an alignment between two sequences is performed. Such 
alignments car be carried out by any number of algorithms known in the art (for examples, 

20 see those described above). The evolution is represented by a Markov process with 

substitution rates, over a fixed interval of time, given by a conditional probability matrix: 

p(aa — > aa ') 

where aa and aa ' are any amino acids. One way to construct such a matrix is to convert the 
25 BLOSUM62 amino acid substitutions matrix (or any other amino acid substitution matrix, 
e.g., PAM100, PAM250) from a log odds matrix to a conditional probability (or transition) 
matrix; 

BLOSUM62y 

PB(i-*j)=p(j)2 A l " ]' (1) 



P(i i s the probability that amino acid i will be replaced by amino acid; through point 
mutations according to the BLOSUM62 scores. The p/s are the abundances of amino acid; 
and are computed by solving the 20 linear equations given by the normalization conditions 
5 that: 

5X*-*/) = l • (2) 

The probability of this process is computed to account for the observed alignment by taking 
the product of the conditional probabilities for each aligned pair: 

10 Pip^Ylpiqehi^aa'n) . (3) 

n 

A family of evolutionary models is then tested by taking powers of the conditional 
probability matrix: p ~p a (aa-+aa } ). The power ,a, that maximizes P is defined to be the 
evolutionary distance. 

Many other schemes may be imagined to deduce the evolutionary distance between two 
1 5 sequences. Fo * example, one might simply count the number of positions in the sequence 
where the two proteins have adapted different amino acids. 

Although the phylogenetic history of an organism can be presented as a vector (as described 
above), the phylogenetic profiles need not be vectors, but may be represented by matrices. 
This matrix includes all the pair wise distances between a group of homologous protein, each 
20 one from a different organism. Similarly, phylogenetic profiles could be represented as 

evolutionary trees of homologous proteins. Functional proteins could then be clustered or 
grouped by matching similar trees, rather than vectors or matrices. 

In order to predict function, different proteins are grouped or clustered according to the 
similarity of their phylogenetic profiles. Similar profiles indicate a correlated pattern of 
25 inheritance, and by implication, functional linkage. The phylogenetic profile method predicts 
that the functions of uncharacterized proteins are likely to be similar to characterized proteins 
within a group or cluster. 
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Grouping or ck stering may be accomplished in many ways. The simplest is to compute the 
Euclidean distaice between two profiles. Another method is to compute a correlation 
coefficient to quantify the similarity between two profiles. All profiles within a specified 
distance of the query profile are considered to be a cluster or group. 

5 Typically a genome database will be used as a source of sequence information. Where the 
genome database contains only a nucleic acid sequences the nucleic acid sequence is 
translated to an amino acid sequence in frame (if known) or in all frames if unknown. Direct 
comparison of :he nucleic acid sequences of two or more organisms may be feasible but will 
likely be more difficult due to the degeneracy of the genetic code. Programs capable of 

1 0 translating a nt cleic acid sequence are known in the art or easily programmed by those of 
skill in the art to recognize a codon sequence for each amino acid. 

FIG. 4 depicts a flow diagram describing the basic algorithm used in determining 
functionally related proteins by the phylogenetic pathway method. Beginning with step 220 
in FIG. 4A, dala is obtained representing a list of proteins from at least two organisms. As 

1 5 described herein the data may be manually input or may be loaded or obtained from a 

database(s). T le data typically will be in the form of amino acid sequence listings or nucleic 
acid sequence istings. At step 222, the list of proteins is compared to create a phylogenetic 
profile. The phylogenetic profile provides an indication of those proteins in each of the at 
least two organisms that share some degree of homology. Such a comparison can be done by 

20 any number of alignment algorithms known in the art or easily developed by one skilled in 
the art (see, for example, those listed above, e.g., BLAST, FASTA etc.) In addition, 
thresholds can be set regarding a required degree of homology. Each protein is then grouped 
at 224 with related proteins that share a similar phylogenetic profile. Grouping algorithms 
include, for example, those described herein. At 226 proteins sharing similar profiles are 

25 indicated and their known functions identified, if any. 

With reference to FIG. 4B, a modification of the method of FIG. 4A is depicted. Beginning 
with step 320 in FIG. 4B, data is obtained representing a list of proteins from at least two 
organisms. Ass described herein the data may be manually input or may be loaded or 
obtained from a database. The data typically will be in the form of amino acid sequence 

30 listings or nuc eic acid sequence listings. At step 322, the list of proteins is aligned between 



each protein in \he input organisms. Such an alignment can be done by any number of 
alignment algorithms known in the art or easily developed by one skilled in the art (see, for 
example, those listed above, e.g., BLAST, FASTA etc.). At step 324, an evolutionary 
distance value is calculated by the methods described above. If the evolutionary distance 
5 threshold is me ; at step 326, those proteins meeting the evolutionary threshold value are 
identified at step 328, otherwise no match is indicated at step 327. 

COMBINATION METHODS 

Prediction of fiinctionally linked proteins by the Rosetta Stone method can be filtered by 
other methods ihat predict functionally-linked proteins, such as the protein phylogenetic 
1 0 profile method or the analysis of correlated mRNA expression patterns. It was found that 

filtering by the:>e two methods for the Rosetta Stone prediction for S. cerevisiae, that proteins 
predicted to be functionally linked by two or more of these three methods were as likely to be 
functionally related as proteins who were observed to physically interact by experimental 
techniques like yeast 2-hybrid methods or co-immunoprecipitation methods. 

1 5 Combinations of these methods of prediction can be used to establish functional links 

between proteiis with very high confidence. The methods of the invention (i.e. , the Rosetta 
Stone method jind the Phylogenetic Profile method) can be combined with one another or 
with other protein prediction methods known in the art (see for example, Eisen et al , 
"Cluster analysis and display of genome-wide expression patterns," Proc. Natl Acad. Sci. 

20 USA, 95: I486: -8 (1998)). 

COMPUTER IMPLEMENTATION 

The various techniques, methods, and aspects of the invention described above can be 
implemented i:i part or in whole using computer-based systems and methods. Additionally, 
computer-based systems and methods can be used to augment or enhance the functionality 
25 described above, increase the speed at which the functions can be performed, and provide 
additional feat ares and aspects as a part of or in addition to those of the invention described 
elsewhere in this document. Various computer-based systems, methods and implementations 
in accordance with the above-described technology are presented below. 



The processor-based system can include a main memory, preferably random access memory 
(RAM), and can also include a secondary memory. The secondary memory can include, for 
example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, 
a magnetic tape drive, an optical disk drive, etc. The removable storage drive reads from 
5 and/or writes to a removable storage medium. Removable storage media represents a floppy 
disk magnetic tape, optical disk, etc., which is read by and written to by removable storage 
drive. As will be appreciated, the removable storage media includes a computer usable 
storage medium having stored therein computer software and/or data. 

In alternative embodiments, secondary memory may include other similar means for 
10 allowing computer programs or other instructions to be loaded into a computer system. Such 
means can include, for example, a removable storage unit and an interface. Examples of 
such can incluc e a program cartridge and cartridge interface (such as the found in video 
game devices), a movable memory chip (such as an EPROM, or PROM) and associated 
socket, and other removable storage units and interfaces which allow software and data to be 
1 5 transferred from the removable storage unit to the computer system. 

The computer system can also include a communications interface. Communications 
interfaces allow software and data to be transferred between computer system and external 
devices. Examples of communications interfaces can include a modem, a network interface 
(such as, for e> ample, an Ethernet card), a communications port, a PCMCIA slot and card, 

20 etc. Software and data transferred via a communications interface are in the form of signals 
which can be electronic, electromagnetic, optical or other signals capable of being received 
by a communications interface. These signals are provided to communications interface via 
a channel capable of carrying signals and can be implemented using a wireless medium, wire 
or cable, fiber Dptics or other communications medium. Some examples of a channel can 

25 include a phone line, a cellular phone link, an RF link, a network interface, and other 
communications channels. 

In this document, the terms "computer program medium" and "computer usable medium" are 
used to generally refer to media such as a removable storage device, a disk capable of 
installation in \ disk drive, and signals on a channel. These computer program products are 
30 means for providing software or program instructions to a computer systems. 



Computer programs (also called computer control logic) are stored in main memory and/or 
secondary mem ory . Computer programs can also be received via a communications 
interface. Such computer programs, when executed, enable the computer system to perform 
the features of Ihe present invention as discussed herein. In particular, the computer 
5 programs, when executed, enable the processor to perform the features of the present 
invention. Accordingly, such computer programs represent controllers of the computer 
system. 

In an embodiment where the elements are implemented using software, the software may be 
stored in, or transmitted via, a computer program product and loaded into a computer system 
10 using a removable storage drive, hard drive or communications interface. The control logic 
(software), when executed by the processor, causes the processor to perform the functions of 
the invention as described herein. 

In another embodiment, the elements are implemented primarily in hardware using, for 
example, hardware components such as PALs, application specific integrated circuits 
1 5 (ASICs) or other hardware components. Implementation of a hardware state machine so as 
to perform the functions described herein will be' apparent to person skilled in the relevant 
art(s). In yet aiother embodiment, elements are implanted using a combination of both 
hardware and software. 

In another embodiment, the computer-based methods can be accessed or implemented over 
20 the World Wide Web by providing access via a Web Page to the methods of the present 

invention. Accordingly, the Web Page is identified by a Universal Resource Locator (URL). 
The URL denctes both the server machine, and the particular file or page on that machine. In 
this embodiment, it is envisioned that a consumer or client computer system interacts with a 
browser to select a particular URL, which in turn causes the browser to send a request for 
25 that URL or page to the server identified in the URL. Typically the server responds to the 
request by retr ieving the requested page, and transmitting the data for that page back to the 
requesting client computer system (the client/server interaction is typically performed in 
accordance wr:h the hypertext transport protocol ("HTTP")). The selected page is then 
displayed to the user on the client's display screen. The client may then cause the server 
30 containing a computer program of the present invention to launch an application, for example 



to perform a Rosetta Stone analysis or Phylogenetic Profile analysis based on a query 
sequence provided by the client. 

The following examples are provided to illustrate the practice of the instant invention, and in 
no way limit the scope of the invention. 

5 EXAMPLES 

Rosetta Stone method 

Some interacting proteins such as the Gyr A and Gyr B subunits of E. coli DNA gyrase are 
fused into a single chain in another organism, in this case the topoisomerase II of yeast 
(Berger et al, Nature 379, 225 (1996)). Thus, the sequence similarities of Gyr A (804 amino 
10 acid residues) (and Gyr B (875 residues)) to different segments of the topoisomerase II (1429 
residues)) suggest by the Rosetta Stone method that Gyr A and Gyr B interact in E, coli. 

To find other such putative protein interactions in E. coli, 3000 (of the total of 4290) protein 
sequences of the E. coli genome (Blattner et al., Science 277, 1453 (1997)) were searched. 
The triplets of Droteins are found with the aid of protein domain databases such as the 

1 5 ProDom or Pfam databases (Corpet et al. Nucleic Acids Res. 26, 323 (1 998); Bateman et al. , 
Nucleic Acids Res. 27, 260 (1999)). Here, a list of all ProDom domains in every one of the 
64,568 SWISS-PROT proteins was prepared, as well as a list of all proteins that contain each 
of the 53,597 ProDom domains. Then every protein in ProDom was considered for its ability 
to be a linking or Rosetta Stone) member in a triplet. All pairs of domains that are both 

20 members of a given protein P were defined as being linked by a protein P, if at least one 
protein with only one of the two domains could be found. By this method 14,899 links 
between the 7*143 ProDom damsons were found. Then in a single genome (such as E. coli) 
all non-homologous pairs of proteins containing linked domains were found. These pairs are 
linked by the Rosetta Stone protein. For E. coli, this method found 353 1 protein pairs. An 

25 alternate method for discovering protein triplets uses amino acid sequence alignment 

techniques to lind two proteins that align to a Rosetta Stone protein such that the alignments 
do not overlap on the Rosetta Stone protein. For E. coli, this method found 4487 protein 
pairs, 1209 of which were also found by the ProDom search method (even though different 
sequence databases were searched for each method). 6809 pairs of non-homologous 

30 sequences, both members of the pair having significant similarity to a single protein in some 
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other genome were found and termed Rosetta Stone sequences because the sequence was 
capable of deciphering the interaction between the protein pairs. 

Each of these 6309 pairs is a candidate for a pair of interacting proteins in E. coll Five such 
candidates are shown in FIG. 1. The first three pairs of E. coli proteins were among those 
5 easily determined from the biochemical literature in fact to interact. The final two pairs of 
proteins are not known to interact. They are representatives of many such pairs whose 
putative interactions at this time must be taken as testable hypotheses. 

Three independent tests of interactions predicted by the Rosetta Stone method were devised, 
each showing tiat a reasonable fraction may in fact interact. The first method uses the 

10 annotation of proteins given in the SWISS-PROT database. For cases where the interacting 
proteins have both been annotated, we compare their annotations, looking for a similar 
function for both members of the pair. Similar function would imply at least a functional 
interaction. Of the 3950 E. coli pairs of known function, 2682 (68%) share at least one 
keyword in their SWISS-PROT annotations (ignoring the keyword "hypothetical protein"), 

1 5 suggesting related functional roles. When pairs of E. coli proteins are selected at random, 

only 15% share a key word. In short, of the £. coli pairs that the Rosetta Stone method turns 
up as candidates for protein-protein interactions, more than half have both members with a 
similar functio i; the method therefore seems to be a robust predictor of protein function. 
Where the function of one member of a protein pair is known, the function of the other 

20 member can be predicted. Performing a similar analysis in yeast turns up 45,502 protein 
pairs. Of the S 857 pairs of known function, 32% share at least one keyword in their 
annotations compared with 14% when proteins are selected at random. 

The second test of the interactions predicted by the Rosetta Stone method uses as 
confirmation the Database of Interacting Proteins (http://doe-mbi.ucla.edu). This is a 
25 compilation of protein pairs that have been found to interact in some published experiment. 
As of December 1998, the database contained 939 entries, 724 of which have both members 
of the pair listed in the ProDom database. Of these 724 pairs, we find 46 or 6.4%linked by 
Rosetta Stone sequences. We expect this percentage to rise as more genomes are sequenced, 
revealing more linked sequences. 
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The third test of Rosetta Stone predictors is by another computational method for predicting 
interactions (Pellegrini et al PNAS 96, 4285 (1999)), the method of phylogenetic profiles, 
which detects fractional interactions by correlated evolution of protein pairs. This method 
was applied to 6809 interactions predicted by the Rosetta Stone method for E. coli proteins. 
5 Some 321 of thsse (~5%)were suggested by the phylogenetic profile method to interact, more 
than eight times as many interaction in common as for randomly chosen sets of interactions. 
Given that the Rosetta Stone method and the phylogenetic profile method rest on entirely 
different assumptions, this level of overlap of predictions tends to support the predictive 
power of both methods. 

1 0 The recognition of many possible pair interactions between proteins of E. coli lead to the 

search for coupled interactions, where A is predicted to interact with B and B with C, and so 
forth. That is, a determination of whether the Rosetta Stone method can turn up complexes of 
proteins or protein pathways was examined. As FIG. 5 shows, suggestive information on 
both pathways and complexes did emerge from linked pairs of E. coli proteins. FIG. 5 A 

1 5 represents the pathways for shikimate biosynthesis and FIG. 5C represents the pathway for 
purine biosynthesis. The enzymes in these pathways for which links were found to other 
members of the same pathway are shown in bold type. The precise links suggested by 
Rosetta Stone sequences are shown in panels FIG. 5B and D. Some of these discovered links 
are between sequential enzymes in the pathway, and others are between more distant 

20 members perhaps suggesting a multienzyme complex. An alternative explanation of the 
same findings s that enzymes in the pathway are expressed in a fused form in some 
organisms as an aid in regulation of expression; in this case linked members of a pair would 
not necessarily bind to each other (see below). 

To evaluate the reliability of Rosetta Stone predictions of protein interactions, it is helpful to 
25 consider why the method should work in the first place. This emerges from considerations of 
protein affinitj . It follows from the laws of thermodynamics that the fusion of protein 
domains A and B into a single protein chain can profoundly enhance the affinity of A for B. 
The reason for this is that fusion greatly reduces the entropy of dissociation of A with B, 
thereby reduciig the association free energy of A to B. This reduction in entropy is often 
30 expressed as ai increase in the effective concentration of A with respect to B. The 



concentrations of proteins in E. coli cells tend to be of the order of micromolar (Pederson et 
al Cell 14,179(1978)) whereas the effective concentrations of fused proteins can be ~mM or 
even greater (Robinson et al PNAS USA 95, 5929 (1998)). Put another way, the standard 
free energy of dissociation protein subunits from a complex is typically 8-20 kcal/mole at 27 
5 °C (corresponding to dissociation constants of 10" 6 to 10" 14 M) (Horton and Lewis, Protein 
Sci. 1, 169 (1952)), and can be reduced by -10 kcal/mol when the subunits are fused into a 
single protein ciain. Because affinity between proteins A and B is greatly enhanced when A 
is fused to B, some interacting pairs of proteins may have evolved from primordial proteins 
that included the interacting domains A and B on the same polypeptide, as shown in FIG. 6. 

1 0 FIG. 6 shows a model for the evolution of protein-protein interactions. The Rosetta Stone 
model starts with the fusion of the genes that code for the non-interacting domains A and B, 
leading to expression of the fused two-domain protein AB (see Table II of J. S. Richardson, 
Adv. Protein Ctiem., 34, 167 (1981). Note that eukaryotic genes, in contrast to prokaryotic 
genes, often code for multidomain proteins. In the fused protein, the domains have a 

1 5 relatively high effective concentration, and relatively few mutations create a primitive 

binding site between the domain that is optimized by successive mutations. In the second 
line, the interaction domains are separated by recombination with another gene to create an 
interacting paii of proteins A and B. An interacting pairs of proteins A and B can be created 
by fission of a protein, so that the preliminary fusion step is not essential to the Rosetta Stone 

20 hypothesis. The lower right-hand step shows another possible mutation, a loop deletion that 
leads to a domain-swapped homodimer. This evolutionary path to homooligomers is the 
analog for homooligomers of the evolutionary path suggested here for heterooligomers. This 
pathway is termed the Rosetta Stone hypothesis for evolution of protein interactions. Also in 
support of the Rosetta Stone pathway is the observation that protein-protein interfaces have 

25 strong similarity to interdomain interfaces within single protein molecules (Tsai and 
Nussinov, J. Mol Biol 260, 604 (1996)). 

It is important to realize that the Rosetta Stone Method makes two distinct predictions. First 
it predicts protein pairs that have related biological function - that is, proteins that 
participate in z. common structural complex, metabolic pathway, or biologic process. 
30 Prediction of function is robust: For E. coli, general function similarity was observed in over 
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half the testable predictions. Second, the method predicts potential protein-protein 
interaction. For this more specific prediction, the considerations of protein affinity and 
evolution aid understanding in which cases the Rosetta Stone method will miss pairs of 
interaction proteins (false negative) and in which cases it will turn up false candidates for 

5 interaction pair:> (false positive). One reason for missing interaction is that many protein- 
proteins interactions may have evolved through other mechanisms, such as gradual 
accumulation of mutations to evolve a biding site. In these cases, there never was a fusion of 
the interaction proteins, so no Rosetta Stone protein can be found. Second, even in other 
cases when the interaction partners were once fused, the fused protein may have disappeared 

10 during the course of evolution, so there is no Rosetta Stone relic remaining to decipher 

binding partnerships. As more genomes are sequenced, however there is a higher chance of 
finding Rosetta Stone proteins. 

False predictions of physical interactions may be made by the Rosetta Stone method in cases 
where domains are fused but not interacting. This may be so when proteins have been fused 

1 5 to regulate coexpression or protein signaling. For these cases, the "interaction" of the 
proteins can be functional interactions rather than physical interactions. Other false 
predictions can arise because the Rosetta Stone method cannot distinguish between homologs 
that bind, and those that do not. As an example, consider the signaling domains SH2 and 
SH3. The kinase domain and the SH2 and SH3 domains of the src homology kinase interact 

20 with one another in the src molecule (Xu et al Nature 385, 595 (1997); Sicheri et al Nature 
385, 602 (1997)), but homologs of these domains are found in many other proteins, and it is 
certainly untrus that all SH2 domains interact with all SH3 domains. A similar problem crops 
up with EGF atid immunoglobulin domains. That is, although the Rosetta Stone method 
gives a robust prediction of protein function of the form "A is functionally linked to B " only 

25 a subset of these putative interactions represent physical interactions between proteins. 

To quantify and reduce errors in predicting protein-protein interactions the occurrence of 
"promiscuous" domains such as SH3 that are present in many otherwise different proteins are 
calculated. These domains can be identified and removed during domain fusion analysis 
{i.e., The Rosetta Stone Method). In the ProDom database of domains, the number of other 
30 domains that each domain could be linked to using the Rosetta Stone method were counted. 
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As shown in FIG. 7, about 95% of the domains are linked to only a few other domains. For 
the 7872 domains in the ProDom domain database for which we can find Rosetta Stone links, 
only about 5% £ire "promiscuous," making more than 25 links to other domains. By filtering 
only 5% of all comains from our Rosetta Stone method, one can remove the majority of 
falsely predicted interactions. When this type of filtering is applied to the 353 1 Rosetta 
Stone links of F, coli found with the ProDom analysis, the number is reduced to 749. 
Although dropping the number of predictions, this filtration step increase the likelihood that 
predicted links represent true physical interactions by 47% over the unfiltered predictions. 
Accordingly, the identification in a genome of many pairs of protein sequences A' and B' 
that are both hcmologs to a single sequence AB in another genome suggests the possibility 
that A' and B' lire binding partners and provides functional information about A' and B'. 

Phylogenetic Profile Method 

We computed phylogenetic profiles for the 4290 proteins encoded by the genome of E. coli 
by aligning each protein sequence, P h with the proteins from 16 other fully sequenced 
genomes (listed at the web site of The Institute for Genome Research) using the BLAST 
algorithm. Proteins coded by the n' h genome are defined as including a homolog of P t if one 
of them aligns to P t with a score that is deemed statistically significant. 

To test whether proteins with similar phylogenetic profiles are functionally linked, the 
phylogenetic profiles for two proteins that are known to participate in structural complexes, 
the RL7 ribosome protein and the FlgL flagellar structural protein, and one known to 
participate in a metabolic pathway, the HIS 5 histidine biosynthetic protein were examined. 
As a first step all other E. coli open reading frames with identical phylogenetic profiles and 
then those with profiles that differ by one bit were identified. The results are shown in FIG. 
8(a) RL7; (b) FlgL; and (c) HIS5. Homologs of ribosome protein RL7 are found in 10 of 1 1 
eubacterial genomes, as well as in yeast, but not in archaeal genomes. In FIG. 8(a) we find 
that more than half of the E. coli proteins with the RL7 phylogenetic profile, or profiles that 
differ by one bit, have functions associated with the ribosome. Since none of these proteins 
has significant amino acid sequence similarity to RL7, the functional relationships to the 
ribosome, had they not been known already, could not be inferred by sequence comparisons. 
This finding supports the idea that proteins with similar profiles are likely to belong to a 
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common group of functionally linked proteins. Several other proteins with these profiles 
have no assigned function and are accordingly listed as hypothetical. The testable prediction 
of the clustering, of phylogenetic profiles is that these as yet uncharacterized proteins have 
functions associated with the ribosome. 

5 The comparisons of the phylogenetic profiles of flagellar proteins, reported in FIG. 8(b), 

further support the idea that proteins with similar profiles are likely to be functionally linked. 
Ten flagellar proteins share a common profile. Their homologs are found in a subset of five 
bacterial genomes: Aquifex aeolicus, Borrelia burgdorferi, Bacillus subtilis, Helicobacter 
pylori, Mycobacterium tuberculosis. Other proteins that appear in neighboring clusters 

1 0 (groups of proteins that share a common profile) include various flagellar proteins and cell 
wall maintenance proteins. Flagellar and cell wall maintenance proteins may be 
biochemically linked, since flagella are inserted through the cell wall. For example, the lytic 
murein transglycosylase (MltD) has a phylogenetic profile that differs by only one bit from 
that of the FlgL flagellar structural protein. This transglycosylase cuts the cell wall for 

1 5 unknown reasons. Therefore another prediction is that this enzyme may participate in 
flagellar assembly. 

While FIGS. 8(a) and (b) include proteins in structural complexes, FIG. 8(c) shows proteins 
involved in amino acid metabolism. It was found that more than half the proteins with 
phylogenetic profiles similar (within one bit) to that of the His5 histidine synthesis protein 
20 are involved in amino acid metabolism. 

The examples of FIG. 8 show that proteins with similar phylogenetic profiles to a query 
protein are likely to be functionally linked with it. The converse shows that groups of 
proteins known to be functionally linked often have similar phylogenetic profiles. In Table I 
groups of E. coli proteins were chosen that share a common keyword in their Swissprot 
25 annotation, reflecting well known families of functionally linked proteins. Since homologous 
proteins coded by the same genome necessarily have similar profiles, they were eliminated 
from the groups. For each group, the number of protein pairs that are "neighbors" were 
computed, wtare neighbors are defined as proteins whose profiles differ by less than 3 bits. 
For a group of N proteins there are at most (N(N-\))/2 possible neighbors. 
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Tab le I Phylogenetic profiles link proteins with similar keywords 



Keyword 



Ribosome 



Transcription 



tRNA synthase & ligase 



Membrane proteins 



Flagellar 



Iron & Ferric 



& Feritin 



Galactose metabolism 



Molybdate & Molybdenum 
& Molyb doterin 



Hypothetical 



Number of 
Proteins* 



60 



36 



26 



25 



21 



Number of 
neighbors in 
Keyword group* 



197 



173 



11 



89 



81 



19 



18 



12 



1084 



16 



31 



108226 



Number of 
neighbors in 
random group 1 



27 



10 



1 



8440 



*E. coli proteins grouped on the basis of a common keyword extracted from their annotation in the Swissprot 
database, dumber of protein pairs, N kWi in the keyword group with profiles that differ by less than 3 bits. 
These pairs are termed neighbors. ^Number of neighbors found on average for a random group of proteins of 
5 the same size as tr e keyword group. Only membrane proteins without uniformly zero phylogenetic profiles 
were included. 

Proteins grouped on the basis of similar keywords in Swissprot have more similar 
phylogenetic p -ofiles than random proteins. Column 2 gives the number of non-homologous 

10 proteins in the keyword group. Column 3 gives the number of protein pairs in the keyword 

group with profiles that differ by less than 3 bits. These pairs are termed neighbors. Column 
4 lists the number of neighbors found on average for a random group of proteins of the same 
size as the keyword group. Only membrane proteins without uniformly zero phylogenetic 
profiles were included. Unlike the other rows of the table, the hypothetical proteins do 

1 5 contain homologous pairs. 

The similarity of the phylogenetic profiles of the proteins that share a common keyword is 

evaluated by a statistical test: the number of neighbors found in our keyword groups were 

compared to the average number of neighbors found in a group of the same size but with 

randomly selected E. coli proteins. We find that the random sets contain on average very few 

20 neighbors con pared to the keyword groups, even though the keyword groups contain only a 

fraction of all oossible neighbor pairs. Thus proteins that are functionally linked are far more 

likely to be neighbors in profile space than randomly selected proteins. However, only a 

fraction of all possible neighbors within a group were found. Therefore not all functionally 

linked protein; have similar profiles; they may fall into multiple clusters in profile space. It is 
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interesting to ncte that hypothetical proteins are also more likely to be neighbors than random 
proteins, suggesting that many hypothetical proteins are part of uncharacterized pathways or 
complexes. 

A second indication that functionally linked proteins are likely to have similar phylogenetic 
5 profiles comes Jxom the analysis of classes of proteins obtained from the EcoCyc library 

(Encyclopedia of E. coli genes and metabolism). Several classes that contain more than ten 
members and represent well known biochemical pathways were selected. These results are 
listed in Table II. The results indicate that this analysis is similar to those found with the 
keyword group;;: members of the group are far more likely to have neighboring profiles than 
1 0 a randomly selected control group. 



Table II Phylogenetic profiles link proteins in EcoCyc classes 



EcoCyc Class 


Number of 
proteins* 


Number of 
neighbors in^ 
EcoCyc class' 


Number of 
neighbors 
random group* 


Carbon cc impounds 


88 


798 


60 


Anaerobic respiration 


66 


275 


30 


Aerobic respiration 


28 


39 


6 


Electron transport 


26 


91 


5 


Purine biosynthesis 


21 


11 


3 


Salvage nucleosides 


15 


10 


1 


Fermeitation 


19 


17 


3 


TCA cycle 


16 


6 


1 


Glycolysis 


14 


5 


1 


Peptidogly can biosynthesis 


12 


10 


1 



*K coli proteins grouped according to metabolic function on the basis of EcoCyc (Encyclopedia of E. coli 
genes and metabolism) classes. f The number of protein pairs, N EC , in the EcoCyc class with profiles that differ 
1 5 by less than 3 bits . These pairs are termed neighbors. ^Number of neighbors found on average for a random 
group of proteins of the same size as the keyword group. 

Proteins grouped according to metabolic function on the basis of EcoCyc classes have more 
similar phylogenetic profiles than random proteins. Column 2 gives the number of proteins in 
20 the EcoCyc class. Column 3 gives the number of protein pairs in the EcoCyc class with 

profiles that differ by less than 3 bits. These pairs are termed neighbors. Column 4 lists the 
number of neighbors found on average for a random group of proteins of the same size as the 
keyword group 
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The ability of the method to predict the function of uncharacterized proteins was tested. The 
function of a protein with that of its neighbors in phylogenetic profile space was equated. 
This is accomplished by means of the keyword annotations found within the Swissprot 
database. To test how effective this method is the keywords of each characterized protein 
were compared to those of the neighbors in phylogenetic profile space. The neighbors, in 
this case, were all other proteins with an identical profile or were proteins with a vector 
distance profile whose Euclidean distance was within 2 evolutionary units. It was found that 
on average 43% of the neighbor keywords overlapped the known keywords of the query 
protein. By comparison, random proteins had only a 4% overlap with the same set of 
neighbors. Thus, a rough estimate was made that for more than half of E. coli proteins one 
can correctly assign the general function by examining the functions of their phylogenetic 
profile neighbors. This estimate should also hold for the ability of phylogenetic profiles to 
assign functions to uncharacterized proteins. 

As another example, the phylogenetic profiles for the 6217 proteins encoded by the genome 
of the yeast Saccharomyces cerevisiae, employing the same methods used for E. coli proteins 
were computed. As in E. coli, where function of a protein was already known, one could test 
the predicted function. In yeast, it was found that on average 29% of the neighbor keywords 
overlapped the known keywords of the query protein, compared to 8% overlap for random 
proteins. 

The phylogenetic profile of a protein describes the presence or absence of homologs in 
organisms. Proteins that make up multimeric structural complexes are likely to have similar 
profiles. Also, proteins that are known to participate in a given biochemical pathway are 
likely to be neighbors in the space of phylogenetic profiles. This demonstrates that 
comparing profiles is a useful tool for identifying the complex or pathway that a protein 
participates in. The method of the invention is able to make functional assignments of 
uncharacterized proteins by examining the function of proteins with identical phylogenetic 
profiles. 

As the numbei of fully sequenced genomes increases, scientists will be able to construct 
longer, and potentially more informative, protein phylogenetic profiles. There are at least 100 
genome projects underway due for completion within the next few months. These data will 
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allow construction of profiles of length 100 rather than 16 bits. Because the number of 
profile patterns grows exponentially with the number of fully sequenced genomes, the results 
of 50 bit comparisons should be considerably more informative than those with 16 bits. 
Furthermore, because the newly sequenced genomes will include several eukaryotic 
organisms, protein phylogenetic profiles should also become a useful tool for studying 
structural comp exes and metabolic pathways in these higher organisms. 

Combination Methods 

As discussed atove, phylogenetic profiles allow sequence unrelated, but functionally-related, 
proteins to be grouped together. A similar analysis can be performed by considering the 
constraint that proteins that function together are usually present in the cell at the same time. 
Such a method exploits the synchronous protein expression requirement by analyzing mRNA 
expression patterns in yeast grown under a variety of conditions. In practice proteins with 
similar mRNA expression patterns are grouped and show that they often have similar 
functions, (see. Eisen et al, Proc. Natl. Acad. Sci. USA 95, 14863-8 (1998)). In much the 
same way, proteins could be clustered according to spatial expression patterns by analyzing 
tissue- or cellular compartment-specific expression patterns. In addition, the Rosetta Stone 
method can be used to predict functional interactions between different proteins in one 
organism by virtue of their fusion into a single protein in another organism. Combining these 
three independent methods of prediction with available experimental data is presented here to 
demonstrate ths first large-scale prediction or protein function. These methods established 
links between proteins of closely related function in the yeast Saccharmyces cerevisiae. 

Experimental Interactions. Pairwise links were created between yeast proteins known from 
experimental literature to interact by such techniques as co-immunoprecipitation and yeast 
two-hybrid methods. We combined interaction data from the MIPS database and the 
Database of Interacting Proteins, a community-developed database of protein-protein 
interactions. 

Linking of Metabolic Pathway Neighbors. Yeast homologs in E. coli proteins were found 
by BLAST homology searches. Pairwise links were defined between yeast proteins whose E. 
coli homologs catalyze sequential reactions (or one reaction step further away) in metabolic 
pathways, as defined in the EcoCyc database. 
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Calculation of Correlated Evolution. Phylogenetic profiles were constructed for each yeast 
protein as described above: 

Calculation of 'Correlated mRNA Expression. Results of 97 individual publicly-available 
DNA chip yeast mRNA expression data sets were encoded as a string of 97 numbers 

5 associated with each yeast open reading frame (ORF) that described how the mRNA 

containing that open reading frame changed levels during normal growth, glucose starvation, 
sporulation, anc expression of mutant genes. This string is the analogue within one organism 
of a phylogenet ic profile. The mRNA levels for each of the 97 experiments were 
normalized, anc only genes that showed a 2 standard deviation change from the mean in at 

1 0 least one experiment were accepted, thereby ignoring genes that showed no change in 
expression levels for any experiment. ORF's with correlated expression patterns were 
grouped together by calculating the 97-dimensional Euclidian distance that describes the 
similarity in mRNA expression patterns. ORFS were considered linked if they were among 
the 10 closest neighbors within a given distance cutoff, conditions that maximized the 

1 5 overlap of ORF annotation between neighbors. 

Calculation of Correlated Gene Fusion Events. Proteins were linked by Rosetta Stone 
patterns as described above as well as by calculating what could be called incomplete triangle 
relationships between proteins. Alignments were found with the program Psi-Blast. 

An analysis us : ng these methods identified 20,749 protein-protein links from correlated 
20 phylogenetic profiles, 26,013 links from correlated mRNA expression patterns, and 45,502 
links from Rosetta Stone sequences. As shown in FIG. 9, these links were combined with an 
additional 500 experimentally-derived protein-protein interactions from the Database of 
Interacting Proteins and the MIPS yeast genome database (Mewes et al. Nucleic Acids Res. 
26, 33-37 (1998)), and 2,391 links among yeast proteins that catalyze sequential reactions in 
25 metabolic pathways. 

Of the 93,750 total functional links found among 4,701 (77%) of the yeast proteins, 4,130 
were defined to be of the 'highest confidence' (known to be correct by experimental 
techniques or validated by 2 of the 3 prediction techniques); 19,521 others are defined as 
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'high confidence' (predicted by phylogenetic profiles), and the remainder were predicted by 
either correlated gene fusion or correlated mRNA expression, but not both. 

The quality of the links was evaluated as follows: one assumes that if one links a protein, A', 
to a group of functionally-related proteins, the shared functions of these other proteins 

5 provide a clue to the general function for A' . Where the function of A' is already known, 
one can test the predicted function. For this test the standardized keyword annotation of the 
Swiss-Prot database was chosen and used to systematically compare the known function of 
all characterized yeast proteins to the function predicted by the methods of the invention. As 
one example chosen from the many yeast proteins tested, the Swiss-Prot keywords for the 

1 0 enzyme ADE 1 , which catalyzes the seveth step of de novo purine biosynthesis, are "Purine 
Biosynthesis" and "ligase". Based upon the frequencies with which keywords appear in the 
annotation of proteins linked to ADE1, it is predicted that the general function of ADE1 to be 
Purine biosynthesis (13.6%), Transferase (1 1.4%), Ligase (6.8%), and Lyase (13.6%). 
Therefore, the analysis is used to predict the general biological process that a protein, here 

1 5 ADE1 , participates in, as well as to link the protein to many other proteins of closely related 
function. The results of the systematic keyword analyses are listed in Table III, along with 
confidence levels, data coverage, and comparisons to random trials. The links verified by 
two independent prediction techniques predict protein function with the same reliability as 
experimental interaction data and at over eight times the level of random trials. 
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TABLE III Prediction of function of yeast proteins: 
data coverage and reliability of predictions 



#of 

proteins 



Individual Pre diction Techniques 

Experi mental 



#of 
Functional 
Links 



Ability to 
Predict 
Known 

Function* 



Ability in 
Random 
Trials^ 



Signal to 

Noise* 



Metabolic pathway neighbors 



Phylogenetic profiles 



Rosetta stone method 



Correlated mRNA expression 



Combined Predictions 



484 



188 



1,976 



1,898 



3,387 



500 



2,391 



20,749 



45,502 



26,013 



33.2% 



20.3% 



33.1% 



26.5% 



11.5% 



4.0% 



4.5% 



7.4% 



7.7% 



6.9% 



8.3 x 



4.5 x 



4.5 x 



3.4 x 



1.7 x 



Links made by > 2 prediction 
techniques 



Highest Confidence Links 



High Conf dence Links 



High and Highest Confidence 
Links 



All Links 



683 



1,223 



1,930 



2,356 



4,701 



1,249 



4,130 



19,521 



23,651 



93,750 



55.6% 



40.9% 



30.8% 



32.0% 



20.7% 



6.9% 



5.5% 



7.4% 



6.8% 



7.2% 



8.1 x 



7.4 x 



4.2 x 



4.7 x 



2.9 x 



*The predictive power of individual techniques and combinations of techniques was evaluated by automated 
comparison of anr otation keywords. By the methods listed, each protein is linked to one or more neighbor 
proteins. For characterized proteins ("query" proteins), the mean recovery of known Swiss-Prot keyword 
annotation by the <eyword annotation of linked neighbors was calculated as: 

<keyword recoveiy> = — V \ — 

where A is the number of annotated proteins, x is the number of query protein Swiss-Prot keywords, N is the 
total number of neighbor protein Swiss-Prot keywords, and n } is the number of times query protein keyword; 
occurs in the neighbor protein annotation. Because functional annotations typically consist of multiple 
keywords, both specific and general, even truly related proteins show only a partial keyword overlap {e.g. 
approx. 35%). 

''Mean recovery of Swiss-Prot keyword annotation for query proteins of known function by Swiss-Prot keyword 
annotation of randomly-chosen linked neighbors, calculated as in Equation (1) for the same number of links as 
exist for real linki: (averages of 10 trials). 



Calculated as ratio of known function recovered by real links to that recovered by random links. 

20 § Experimentally-observed yeast protein-protein interactions contained in the DIP and MIPS (Mewes et al. 
Nucleic Acids Res. 26:33-37 (1998)) databases. 
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These links provide a means to characterize proteins of unknown function. There are 2,557 
uncharacterized proteins in yeast (Mewes et al. Nucleic Acids Res. 26:33-37 (1998)), 
proteins not studied experimentally and with no strong homologs of known function. Of 
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these, 374, or 15%, can be assigned a general function from the high and highest confidence 
functional links and 1,524, or 60%, can be assigned a general function using all links. 

A specific example of the assignment of function is shown in Figure 10 for a protein (yeast 
open reading frame YGR021 W) from a highly conserved protein family of unknown 
function. On ths basis of the methods described here and the functional links they uncover, 
this family can now be assigned a function related to mitochondrial proteins synthesis. Two 
of the functional partners of YGR021W are also proteins in conserved families of unknown 
function: the gidA family and the C. elegans M02F4.4 family. These families too can now 
be associated with mitochondrial (or bacterial) protein synthesis. The link to triose- 
phosphate isomerase (FIG. 10) is particularly interesting in light of the human myopathy in 
which a deficieicy of this enzyme is correlated with grossly altered mitochondrial 
structure(Bardcsi et al Acta Neuropathol (Berl) 79, 387-394 (1990)). 

Two additional examples of links are given: those to the yeast prion Sup35 (Wickner, R.B., 
Science 264, 566-569 (1994)), and those to MSH6, the yeast homolog of human colon-cancer 
related genes (Miyaki et al, Nature Struct. Biol, 17, 271-272 (1997)). In both cases, a 
general function is already known, but the method of the invention also predicts novel 
functional link:;. In particular, in Figure 1 1, the yeast prion Sup35, which acts as a translation 
release factor in its non-prion state, is linked with many proteins involved in protein 
synthesis consistent with Sup35's primary role of interacting with the ribosome to release the 
newly synthesized peptide chain (Kushirov et al, Gene 66, 45-54 (1988); Stansfield et al 
EMBO J. 14, 4365-4373 (1995)). Also linked to Sup35 are protein sorting and targeting 
proteins, consistent with an accessory role in guiding nascent proteins to their final cellular 
destinations. Sup35 shows both correlated evolution and correlated mRNA expression with 
components of the CCT chaperonin system, a yeast chaperonin system believed to aid 
folding of newly synthesized actin and microtubules. 

Novel links are also established when we examine MSH6, a DNA mismatch repair protein 
(Johnson et al, J. Biol. Chem. 271, 7285-7288 (1996)) whose human homologs, when 
mutated, cause the majority of hereditary nonpolypoid colorectal cancers (reviewed in: 
Lynch et al Ann. N. Y. Acad. Sci., 833, 1-28 (1997)). MSH6 is homologous to several other 
DNA mismatch repair proteins and, in Figure 12, is linked to the sequence-unrelated PMS1 
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DNA mismatch repair protein family, mutations of which, in humans, are also tied to 
colorectal cance - (Papadopolous et al 9 Science 263, 1625-1629 (1994)). MSH6 is in turn 
linked via homolog MSH4 to the purine biosynthetic pathway by methylenetetrhydrofolate 
dehydrogenase Jind, to two RNA modification enzymes, and, to an uncharacterized protein 
family, which a in now be investigated in light of DNA repair and potential participation of 
human homologs in cancer. 

A number of embodiments of the invention have been described. Nevertheless, it will be 
understood that various modifications may be made without departing from the spirit and 
scope of the invention. 
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WHAT IS CLAIMED IS: 



1 L A method for determining functional links between at least two polypeptides, 

2 comprising: 

3 (a) aligning a primary amino acid sequence of multiple distinct non- 

4 homologous polypeptides to the primary amino acid sequences of a 

5 plurality of proteins; 

6 (b) for any alignment found between the primary amino acid sequences of all 

7 cf such, multiple distinct non-homologous polypeptides and the primary 

8 amino acid sequence of at least one such protein, outputting an indication 

9 identifying the at least one such protein as an indication of a functional 

10 link between the multiple polypeptides; 

11 ( c ) obtaining data, comprising a list of polypeptides from at least two 

12 genomes; 

13 (d) comparing the list of polypeptides from at least two genomes to form a 

14 protein phylogenetic profile for each protein, wherein the protein 

15 phylogenetic profile indicates the presence or absence of a polypeptide 

16 belonging to a particular protein family in each of the at least two 

1 7 genomes based on homology of the polypeptides; and 

1 8 (e) grouping the list of polypeptides from a particular protein family based on 

19 similar profiles, wherein a similar profile is indicative of a functional link 

20 between the polypeptides; 

2 1 (f) comparing the functional links identified in step (b) and step (e) or both to 

22 functional links identified by patterns of correlated expression, 

23 experimentally measured interactions, and functional relationships. 

1 2. The method of claim 1, further comprising, displaying the functional links as 

2 networks of related proteins comprising: 

3 (g) placing all polypeptides in a diagram such that functionally linked proteins 

4 are closer together than all other proteins; and 

5 (h) identifying proteins that fall in a cluster in said diagram as a functionally 

6 related group. 
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ABSTRACT 

A computational method system, and computer program are provided for inferring 
functional links from genome sequences. One method is based on the observation that 
some pairs of proteins A' and B' have homologs in another organism fused into a single 
protein chain A B. A trans-genome comparison of sequences can reveal these AB 
sequences, which are Rosetta Stone sequences because they decipher an interaction 
between A' and B. Another method compares the genomic sequence of two or more 
organisms to create a phylogenetic profile for each protein indicating its presence or 
absence across all the genomes. The profile provides information regarding functional 
links between different families of proteins. In yet another method a combination of the 
above two metiods is used to predict functional links. 

10016819.doc 
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Protein A 
(genome i) 



Protein B 
(genome j) 



FIG. IB 



Protein C 
(genome k) 



Input primary sequence of at least 
two distinct non-homologous 
polypeptides 




Align all primary sequences to 
individual proteins in protein 
sequence database 




Indicate no 
match 
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Output indication of matching 
"Rosetta Stone" proteins 



FIG. 2A 
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Input Primary Sequence of Rosetta 
Stone Protein 
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Align primary sequence to 
individual proteins in protein 
sequence database 




Output indication of distinct 
proteins containing non- 
overlapping primary sequences of 
Rosetta Stone Protein 
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FIG. 2B 




Genomes: 



P2 P4 



P7 



S. cerevisiae (SC) 



PI P2 P3 P4 
P5 P6 P7 

£. coli (EC) 



Genomic profiles: 



EC 

PI 
P2 
P3 
P4 
P5 
P6 
P7 



SC BS HI 



i 
4 
0 
1 
1 
0 
1 



0 

1 
1 



1 

0 

1 



0 0 

1 1 



1 

0 




P2 
P3 
P5 
P6 
P7 



B. subtilis (BS) 



H. influenzae (HI) 



=> 



Profile Clusters: 



P4 



1 0 0 



P2 
P7 



PI 



0 1 



P5 



P3 


0 


1 


1 


P6 


0 


1 


1 



0 
0 



1 1 



Conclusion: P2 and P7 are functionally linked , 
P3 and P6 are functionally linked 
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Obtaining data representing a list 
of proteins from at least two 
organisms 
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Comparing the list of proteins to 
form a protein phylogenetic profile 
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Grouping the list of proteins based 
on similar profiles 
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Output indication of proteins 
having similar profiles 




Obtaining data representing a list 
of proteins from at least two 
organisms 



Aligning the sequences of the list 
of proteins from at least two 
organisms 
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322 
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Calculating the evolutionary 
distance of the at least two aligned 
sequences 
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Do the sequences meet the 
evolutionary distance 
threshold? 



-No- 






Indicate no 


► 


Match 



Yes 



Output indication of proteins 
meeting evolutionary threshold value 
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Initia Profile 



One bit different 



1877 PgsA phospholipid synthesis 
[ 2895 YGGH hypothetical 



3885 RL7 ribosome L7 
6224 RL15 ribosome L15 
-3217 RL 17 ribosome L 17 
i 177 PTH peptidyl-tRNA hycrolase I 
5518 RNC ribonuciease IE 



0648 YBEX hypothetical 
3624 RL34 ribosome L34 
3222 RL36 ribosome L36 
31 15 RL27 ribosome L27 
3097 RS 15 ribosome S 15 
2731 YQCB hypothetical 
0058 YABO hypothetical 
1059 YCEC hypothetical 
0229 RFH peptide release factor 
2539 ClpB heat shock protein 



1 407 LYJFH hypothetical! 
3230 RS14 ribosome S 14 I 



[ 1387 G3P3 dehydrogenase 



3242 RL4 ribosome L4 
1945 NONE hypothetical 



2561 GrpE co-chaperone 



3661 GidB glucose inhib. division 

3232 RL24 ribosome L24 

32 10 DEF polypeptide deformylase 

1684 RL20 ribosome L20 

0188 MesJ cell cycle protein 

2553 RL 19 ribosome L 19 

3116 RL21 ribosome L21 

4094 RL9 ribosome L9 

2567 SmpB small protein B 



nitial Profile 



One bit different 



1 3132 RP54 sigma factor ] 




1174 DHAR operon regulation 
3345 RtcR transcription regulator 
binds sigma factor RP54 
0205 MltD lytic mureine transglycosylase 



1915 FliR flag biosymh 

1914 FliQ flag biosy nth 

1911 FliN motor 

1858 MotA motility 

1056 FlgLflag hook 

1051 FlgG flag hook/basal 
1050 FlgF flag hook/basal 
1049 FlgE flag hook/basal 

1047 FlgCflag basal body 

1055 FlgK flag hook 



0624 RodA rod shape det. 
0089 FtsW cell division 
1 163 Alr2 alanine rucemase 
1070 YCEG hypothetical \ 



4060 AmiB Nam- Ala amidase* 
3948 Air 1 alanine racemase 
2890 YGGW hypothetical 



1 1889 FliD flag, hook | 



1910 FliM motor 
1046 FlgB flag basal 



Initial Pre One bit different 



1982 His5 His synth 
I233TrpCTrpsym 
2755 ArgA Arg syir 
3291 CysG Cys syn 
0728 YBGR hypoth 




1623 LHRhelicase 
3050 Thd2 Thr catabolism 
3353 Glc2 glycogen synthesis 
3553 RfaG LPS synthesis 
3822 YUP hypothetical 
3866 ArgB Arg synthesis 



1202 AdhE alcohol dehydrogenase 

1358 MaoC monoamine metabolism 
3006 OAT ornithine aminotransferase 



1 153 NONE hypothetical 
2546 PheA Phe synthesis 



I 1983 His4His synthesis 



1978 Hisl His synthesis 
3142 GltB Glu synthesis 



1 0078 llvH Val / Ee synthesis" 




Experimental 
Data 

Link proteins 
known to interact 
by experiment 



Related 
Metabolic 
Function 

Jnk proteins 
whose homologs 
are known from 
experiment to 
operate 
sequentially in 
metabolic 
pathways 



Related 
Phylogenetic 
Profiles 

Link proteins that 
evolved in a 
correlated fashion 
in the 20 
organisms with 
fully-sequenced 
genomes 



Rosetta 
Stone 
Method 

Link proteins 
whose homologs 
are fused into a 
single gene in 
another organism 




Correlated 

mRNA 
Expression 

Link proteins 
whose mRNA 
levels are 
correlated across 
97 assays of 
yeast mRNA 
levels 



Predict function of uncharacterized proteinsS 
using links with characterized proteins J 



YGR021W — 

member of highly 

conserved 
protein family of 
unknown function 



— MRPL2 

— MRPL6 

— MRPL7 

— MRPL10 

— MRPL16 

— MRPL23 

— MRPS9 
MRPS28 



I — YCR083W homology to thioredoxin 



known 
ribosomai 
proteins 



J 



pro 



predicted to 

target 
mitochondria 



synthesis 



— MRF1 peptide chain release factor 

— YJR113C homology to ribosomai protein S7 
MSY1 tyrosyl-tRNA synthetase 

— YGL068W probable ribosomai protein L12 

— MGE1 heat shock protein/chaperone 

— YDR116C homology to bacterial ribosomai L1 protein 

— YHR189W homology to peptidyl-tRNA hydrolase 
— C SIS1/XDJ1 homology to DnaJ heat shock protein ) 
— C PDR13/SSE1/LHS1 homology to Hsp70 ) 

RIB2 DRAP deaminase s 
" VYDLQ36C homology to RIB2 / pseudouridine synthase. 



em 



J 



1MIS1 C1-THF synthase 
.ADE3 C1-THF synthase 
TPI1 triose phosphate isomerase 

YGL236C homology to conserved gidA family, unknown function 
YOL060C homology to hypothetical C. elegans protein M02F4.4 



^9 



Sup35 - 

GTP-binding 
peptide chain 
release factor 
and prion 



4 subunits of RNA polymerase II, and III 
DIM1 rRNA dimethyltransferase 



mRNA 



tRNA & 

mRNA 

synthesis 

ribosome 



8 ATP-dependent RNA helicases x spj - biogen e S is 
snRNP-specific elongation factor 2 >\ 
8 amino acid-tRNA synthetases 
6 ribosomal proteins 
Sup45 peptide chain release factor 
3 translation initiation factors 
6 subunits of CCT chaperonin 



protein 
synthesis 
and 
folding 



MAPI Met-amino peptidase 



5 protein sorting and targeting proteins 
2 DNA replication factor C subunits 
Glutathione reductase 
QRI7 and YKR038C peptidases 
YOR11W ATP-dependent permease 
TOP3 DNA topoisomerase III 



. protein 
targeting 



•YDR097C ._J 

MSH6 

DNA mismatch 
repair 



( YHR120W MSH1 DNA mismatch repair, mitoch.^ 
• YOL090W MSH2 DNA mismatch repair 
•YCR092C MSH3 DNA mismatch repair 
r-j- YFL003C MSH4 meiosis-specific protein 
YDL154W MSH5 meiosis-specific protein 

YKR080W MTD1 CH2-THF dehydrogen ase 

"YML080W members of uncharacterized^ 
YLR405W protein family J 



— YOR274W MOD5 tRNA isopentenyltransferase 

— YER168C CCA1 tRNA nucleotidyltransferase 

/*YMR167W MLH1 DNA mismatch repair 
• YNL082W PMS1 DNA mismatch repair 

YLR035C homology to human mutL 
^YPL164C homology to mismatch repair MlhlpJ 
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